Tobacco use remains one of the leading preventable causes of illness and premature death worldwide. Initiating tobacco use during adolescence significantly raises the risk of addiction and long-term health consequences. Despite widespread public health campaigns, millions of youth begin smoking by age 13–15.
The Global Youth Tobacco Survey (GYTS) offers over two decades of internationally standardized data on tobacco use behaviors in youth. This dataset includes 30,000+ records across countries, capturing regional, demographic, and policy-related information. Our analysis seeks to:
We aim to: - Inform stakeholders (e.g., WHO, CDC) by evaluating which MPOWER strategies are most effective - Support prevention strategies through evidence-based targeting of at-risk populations - Encourage early intervention and strategic resource allocation, especially in lower-income regions
This project investigates global trends in youth tobacco use and identifies the most impactful predictors and regional risks using the Global Youth Tobacco Survey (GYTS). Our main objective is to model youth tobacco prevalence and classify survey sites as high-risk (≥15%) using statistical learning. We implemented a rigorous, multi-method approach using regression and classification techniques to support actionable public health insights. We found strong evidence that youth tobacco use is declining globally, especially in regions with robust public health policies. However, disparities persist by region, sex, and media exposure. Our models—particularly logistic regression, LDA, and lasso—offer highly interpretable and accurate predictions, and can support prioritization efforts for prevention policies and resource allocation. This analysis not only reveals the critical factors behind youth smoking but provides a data-driven roadmap for improving global tobacco control efforts.
Youth tobacco use remains a global public health challenge. Leveraging the Global Youth Tobacco Survey, we applied a suite of statistical learning models to identify key predictors and classify high-risk regions. Our analysis reveals significant declines in youth tobacco use over time, but notable disparities persist across regions, genders, and policy environments. Key methods like logistic regression, LDA, and lasso regression yielded over 79% classification accuracy. These insights can guide targeted policy interventions and monitoring efforts worldwide.
Source: CDC & WHO GYTS database (1999–2018)
Structure: Panel data with country, region, and demographic details (n = 30,535)
Challenges: Missing values in Data_Value, factor encoding of categorical variables, variable inconsistency across years.
The decline in tobacco use among youth is most prominent in regions with consistent implementation of public health measures, particularly MPOWER policies. Our models confirm that time (year) is a key driver of decreasing prevalence, validating global efforts to combat youth smoking. However, the persistently higher prevalence in the European and Western Pacific regions signals regional disparities in policy enforcement and education.
Gender differences were clear: females consistently reported lower tobacco use, aligning with historical trends but also highlighting potential gaps in gender-targeted interventions. Survey topics related to advertising exposure or anti-smoking messaging had strong predictive power, emphasizing the impact of media on youth behavior.
From a policy standpoint, classification models like LDA and logistic regression allowed us to isolate high-risk countries, providing a clear and statistically sound framework for global tobacco control agencies to prioritize interventions.
Ridge and lasso regression models further validated our selection of key variables while reducing the risk of overfitting. These methods also clarified that while some variables (like sample size or specific indicators) added noise, others—like topic and region—consistently enhanced predictive accuracy. - Cross-validation: 10-fold CV for tuning (e.g., λ in lasso, k in KNN) - Bootstrap: Validated coefficient stability in regression - Confusion Matrices, ROC Curves: Evaluated classification performance
To support our core questions—identifying influential predictors and classifying high-risk countries—we visualized model predictions and variable importance across several statistical learning methods. These plots help reinforce the observed patterns and model decisions.
Stakeholders: WHO, CDC, Ministries of Health, educators, parents, youth-focused NGOs.
Ethical Notes:
- Data involves minors: privacy and responsible classification are essential
- Avoid stigmatizing high-risk countries or demographics
- Use results for policy guidance, not punitive action
Our findings confirm that statistical learning can effectively uncover nuanced global patterns in youth tobacco use and guide public health strategies. Key insights include:
- A clear global decline in youth tobacco use, strongest in regions implementing MPOWER policies.
- Significant gender disparity, with females consistently reporting lower tobacco use.
- The strongest predictors across all models were Year, WHO_Region, Topic, and Sex.
- Public health policy presence and exposure to pro- or anti-tobacco content significantly influenced youth behavior.
- Model accuracy: Logistic Regression and LDA achieved ≈79%, KNN at 65%, and Ridge/Lasso regression improved prediction reliability by reducing overfitting.
- Bootstrap and cross-validation techniques confirmed model robustness and generalizability.
These results provide a foundation for policy targeting and long-term global monitoring efforts.
Based on our findings, we propose the following:
By implementing these strategies, public health organizations can not only curb youth tobacco use more effectively, but also improve the equity and precision of their tobacco control programs. and monitoring in regions with persistently high usage - Use classification models (e.g., logistic regression, LDA) to flag high-risk zones for intervention